149 research outputs found

    Eine statistische Methode zur Erkennung von Dokumentstrukturen

    Get PDF
    This PhD thesis is on the topic of document recognition. It particularly discusses the aspects of learning document models and the recognition of the logical structure of documents. In order to achieve high reliability and user friendliness, we describe an interactive system which can easily be adapted to new document classes. In an initial learning session the system is able to generate a recognition model based on a small set of completely tagged logical documents. In the successive recognition sessions, the user interactively corrects the recognition errors of the system. In order to prevent it from repeating the same errors again, these corrections are automatically integrated to the model thanks to the system's incremental learning capabilities. The representation of the document model is based on a novel, statistical formalism. It is based on n-grams, which have been generalized to be able to represent tree structures. The basic principle consists in the representation of local patterns in tree structures using the conditional probabilities of n-grams. Such a statistical model is able to represent one document class at a time. In the discussion of the expressiveness of the statistical model, we introduce the notion of the entropy of a model. We further introduce a learning algorithm, which estimates the n-gram probabilities of the model based on a set of sample documents. The same algorithm is again used in the incremental learning steps. The recognition of the physical structure of a document is based on classical methods that have been documented in the literature. However, the logical structure tree is here constructed stepwise on top of the physical structure, using a heuristic bottom-up procedure. The optimal solution is found in an efficient way by a quality measure and a best-first search strategy. The approach has been empirically validated on three different document classes, the main test series consisting in 25 documents of an article collection with average structural complexity and containing a total of 400 pages. The tests revealed that the recognition rate of the system constantly improves with the number of recognized documents. When the end of this training and recognition phase has been reached, about one correction is necessary every four pages. Finally, possibilities of integrating the statistical n-gram model with existing standards like SGML/DSSSL are discussed. To this purpose, a method which translates a statistical model into the corresponding DTD is described.Die vorliegende Dissertation behandelt die Erkennung von Dokumenten. Es werden schwerpunktmässig die Aspekte des Lernens von Dokumentmodellen und der Erkennung der logischen Struktur von Dokumenten betrachtet. Um sowohl eine hohe Zuverlässigkeit als auch Bedienungsfreundlichkeit zu erreichen, wird ein interaktives System beschrieben, das sich leicht an neue Dokumentklassen anpassen lässt. Das System benötigt eine initiale Lernfähigkeit, indem es aus vollständigen, logischen Dokumenten ein vorläufiges Erkennungsmodell generieren kann. In darauf folgenden Erkennungsvorgängen werden allfällige Fehler interaktiv vom Benutzer korrigiert. Durch die inkrementelle Lernfähigkeit des Systems werden die Korrekturen in das Modell integriert, und so die Wiederholung desselben Fehlers verhindert. Für die Darstellung des Dokumentmodells wird ein neuartiger, statistischer Formalismus verwendet. Er basiert auf n-Grammen, die in einer Weise erweitert wurden, dass sie auch Baumstrukturen repräsentieren können. Das Grundprinzip basiert auf der Darstellung lokaler Muster in Baumstrukturen durch die bedingten Wahrscheinlichkeiten von n-Grammen. Ein derartiges statistisches Modell vermag jeweils eine Dokumentklasse vollständig zu beschreiben. In der Diskussion um die Repräsentationsfähigkeit des statistischen Modells wird der Begriff der Entropie eingeführt. Es wird ein Lernalgorithmus vorgestellt, der die n-Gramm-Wahrscheinlichkeiten aus vorgelegten Beispieldokumenten schätzt. Derselbe Algorithmus gelangt auch in inkrementellen Lernphasen zur Anwendung. Die Erkennung der physischen Struktur eines Dokuments erfolgt mit klassischen Methoden aus der einschlägigen Literatur. Auf der physischen Struktur eines zu erkennenden Dokuments wird mit einem bottom-up Verfahren der logische Strukturbaum konstruiert. Die Heuristik wählt unter Verwendung einer Bewertungsfunktion und einer best-first Suchstrategie effizient eine optimale Lösung aus. Der Ansatz wird an Dokumenten aus drei verschiedenen Klassen validiert. Die Haupttestserie besteht aus 25 Dokumenten mit insgesamt 400 Seiten einer Serie von Artikeln mittlerer Komplexität. Die Tests belegen, dass die Erkennungsleistung des Systems mit der Anzahl erkannter Dokumente zunimmt, so dass schliesslich etwa eine Korrektur pro vier Seiten nötig ist. Schliesslich werden Integrationsmöglichkeiten des statistischen n-Gramm-Modells mit bestehenden Standards wie zum Beispiel SGML/DSSSL erforscht. Es wird dazu eine Methode vorgestellt, die ein statistisches Modell in eine entsprechende DTD übersetzt

    Reconnaissance de documents assistée: architecture logicielle et intégration de savoir-faire

    Get PDF
    Cette thèse aborde la reconnaissance de documents suivant une approche assistée, qui vise à exploiter au mieux les compétences respectives de l’homme et de la machine. Nos contributions portent notamment sur les questions d’architecture logicielle soulevées par la mise en oeuvre de systèmes de reconnaissance de documents. Les avantages d’un environnement coopératif sont motivés par une analyse critique des systèmes actuels, et une projection sur les futures applications de la reconnaissance de documents. Diverses propositions concrètes sont émises sur la conduite du dialogue homme-machine, ainsi que sur les possibilités d’amélioration à l’usage. L’inventaire des données à gérer dans un système de reconnaissance est organisé de façon modulaire et homogène, et représenté à l’aide du format standard DAFS Sur le plan du contrôle, le système est décomposé selon une modélisation multi-agents. Cette découpe conceptuelle est alors simulée dans notre plateforme de développement, qui repose sur la programmation concurrente, distribuée, et multi-langages. Une solution expressive est proposée pour le couplage entre le noyau de l’application et l’interface graphique. Le prototype qui a servi à valider l’architecture est présenté. Notre architecture logicielle encourage l’exploitation du savoir-faire typographique, par l’intermédiaire d’un support de fontes standardisé. Ce rapprochement entre les deux disciplines profite à la fois à l’ergonomie, à la valorisation des résultats de reconnaissance, et aux méthodes d’analyse automatiques. Nous présentons une poignée d’analyseurs originaux, pour des tâches de reconnaissance de caractères, d’identification des fontes, ou de segmentation. Les expériences conduites en guise de première évaluation démontrent l’utilité potentielle de nos outils d’analyse. Par ailleurs, une contribution est apportée au problème de l’évaluation des performances de systèmes de reconnaissance assistée, avec l’introduction d’un nouveau modèle de coûts. Celui-ci intègre l’influence du comportement de l’utilisateur, de même que l’amélioration des performances liée au phénomène d’apprentissage incrémental. Notre modèle de coûts est utilisé dans des simulations, ainsi que dans des expériences mettant en jeu des analyseurs existants. Les observations mettent en évidence la dynamique particulière des systèmes assistés par rapport aux approches entièrement automatiques.This thesis addresses the question of document recognition with an assisted perspective advocating an adequate combination between human and machine capabilities. Our contributions tackle various aspects of the underlying software architecture. Both a study of existing systems and a projection on some future applications of document recognition illustrate the need of cooperative environments. Several mechanisms are proposed to drive the human-machine dialog or to make the recognition systems able to improve with use. The various data involved in a recognition system are organized in a modular and homogeneous way. The whole information is represented using the DAFS standard format. In our proposition, the control is decentralized according to a multi-agent modelling. This conceptual scheme is then simulated on our development platform, using concurrent, distributed, and multi-languages programming. An expressive solution is proposed for the coupling between the application kernel and a graphical user interface. A prototype is realized to validate the whole architecture. Our software architecture takes advantage of the typographical know-how, through the use of a standardized font management support. This integrated approach lets us enhance the ergonomy, extend the possible use of the recognition results, and redefine some recognition techniques. A few innovative analyzers are described in the field of optical character recognition, font identification, or segmentation. The first experiments show that our simple methods behave surprisingly well, with respect to what can be expected from the state of the art. Besides, we bring a contribution to the problem of measuring the performance of cooperative recognition systems, through the introduction of a new cost model. Our notations are able to describe assisted recognition scenarios, where the user takes part in the process, and where the accuracy is modified dynamically thanks to incremental learning. Our cost model is used both in simulations and in experiments implying existing analyzers. The dynamic aspects of assisted systems can then be observed

    Historical Document Image Segmentation with LDA-Initialized Deep Neural Networks

    Full text link
    In this paper, we present a novel approach to perform deep neural networks layer-wise weight initialization using Linear Discriminant Analysis (LDA). Typically, the weights of a deep neural network are initialized with: random values, greedy layer-wise pre-training (usually as Deep Belief Network or as auto-encoder) or by re-using the layers from another network (transfer learning). Hence, many training epochs are needed before meaningful weights are learned, or a rather similar dataset is required for seeding a fine-tuning of transfer learning. In this paper, we describe how to turn an LDA into either a neural layer or a classification layer. We analyze the initialization technique on historical documents. First, we show that an LDA-based initialization is quick and leads to a very stable initialization. Furthermore, for the task of layout analysis at pixel level, we investigate the effectiveness of LDA-based initialization and show that it outperforms state-of-the-art random weight initialization methods.Comment: 5 page

    DocMIR: An automatic document-based indexing system for meeting retrieval

    Get PDF
    This paper describes the DocMIR system which captures, analyzes and indexes automatically meetings, conferences, lectures, etc. by taking advantage of the documents projected (e.g. slideshows, budget tables, figures, etc.) during the events. For instance, the system can automatically apply the above-mentioned procedures to a lecture and automatically index the event according to the presented slides and their contents. For indexing, the system requires neither specific software installed on the presenter's computer nor any conscious intervention of the speaker throughout the presentation. The only material required by the system is the electronic presentation file of the speaker. Even if not provided, the system would temporally segment the presentation and offer a simple storyboard-like browsing interface. The system runs on several capture boxes connected to cameras and microphones that records events, synchronously. Once the recording is over, indexing is automatically performed by analyzing the content of the captured video containing projected documents and detects the scene changes, identifies the documents, computes their duration and extracts their textual content. Each of the captured images is identified from a repository containing all original electronic documents, captured audio-visual data and metadata created during post-production. The identification is based on documents' signatures, which hierarchically structure features from both layout structure and color distributions of the document images. Video segments are finally enriched with textual content of the identified original documents, which further facilitate the query and retrieval without using OCR. The signature-based indexing method proposed in this article is robust and works with low-resolution images and can be applied to several other applications including real-time document recognition, multimedia IR and augmented reality system

    Description languages for multimodal interaction: a set ofguidelines and its illustration with SMUIML

    Get PDF
    This article introduces the problem of modeling multimodal interaction, in the form of markup languages. After an analysis of the current state of the art in multimodal interaction description languages, nine guidelines for languages dedicated at multimodal interaction description are introduced, as well as four different roles that such language should target: communication, configuration, teaching and modeling. The article further presents the SMUIML language, our proposed solution to improve the time synchronicity aspect while still fulfilling other guidelines. SMUIML is finally mapped to these guidelines as a way to evaluate their spectrum and to sketch future work

    A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis

    Full text link
    Automatic analysis of scanned historical documents comprises a wide range of image analysis tasks, which are often challenging for machine learning due to a lack of human-annotated learning samples. With the advent of deep neural networks, a promising way to cope with the lack of training data is to pre-train models on images from a different domain and then fine-tune them on historical documents. In the current research, a typical example of such cross-domain transfer learning is the use of neural networks that have been pre-trained on the ImageNet database for object recognition. It remains a mostly open question whether or not this pre-training helps to analyse historical documents, which have fundamentally different image properties when compared with ImageNet. In this paper, we present a comprehensive empirical survey on the effect of ImageNet pre-training for diverse historical document analysis tasks, including character recognition, style classification, manuscript dating, semantic segmentation, and content-based retrieval. While we obtain mixed results for semantic segmentation at pixel-level, we observe a clear trend across different network architectures that ImageNet pre-training has a positive effect on classification as well as content-based retrieval
    • …
    corecore